Nature Computational Science
○ Springer Science and Business Media LLC
Preprints posted in the last 7 days, ranked by how well they match Nature Computational Science's content profile, based on 50 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit.
Muller, B.; Ortiz Barranon, A. A.; Roberts, L.
Show abstract
Dysarthric speech severity assessment typically requires either trained clinicians or supervised machine learning models built from labelled pathological speech data, limiting scalability across languages and clinical settings. We present a training-free method (no supervised severity model is trained; feature directions are estimated from healthy control speech using a pretrained forced aligner) that quantifies dysarthria severity by measuring the degradation of phonological feature subspaces within frozen HuBERT representations. For each speaker, we extract phone-level embeddings via Montreal Forced Aligner, compute d scores along phonological contrast directions (nasality, voicing, stridency, sonorance, manner, and four vowel features) derived exclusively from healthy control speech, and construct a 12-dimensional phonological profile. Evaluating 890 speakers across10corpora, 5 languages for the full MFA pipeline (English, Spanish, Dutch, Mandarin, French) and 3 primary aetiologies (Parkinsons disease, cerebral palsy, amyotrophic lateral sclerosis), we find that all five consonant d features correlate significantly with clinical severity (random-effects meta-analysis rho = -0.50 to -0.56, p < 2 x 10^-4; pooled Spearman rho = -0.47 to -0.55 with bootstrap 95% CIs not crossing zero), with the effect replicating within individual corpora, surviving FDR correction, and remaining robust to leave-one-corpus-out removal and alignment quality controls. Nasality d decreases monotonically from control to severe in 6 of 7 severity-graded corpora. Mann-Whitney U tests confirm that all 12 features distinguish controls from severely dysarthric speakers (p < 0.001).The method requires no dysarthric training data and applies to any language with an existing MFA acoustic model (currently 29 languages) or a model trained from healthy speech alone. It produces clinically interpretable per-feature profiles. We release the full pipeline and phone feature configurations for six languages to support replication and clinical adoption. Author SummaryOne of the authors has lived with ALS for sixteen years. Bernard Muller, who built this entire analytical pipeline using only eye-tracking technology, has experienced the progression of the disease firsthand, including the dysarthric speech that comes with advancing ALS and the tracheostomy that followed. The problem this paper addresses is not abstract to him, and that shapes how the method was designed. We developed a method to measure how well a person with dysarthria can produce distinct speech sounds, without needing any recordings of disordered speech for training. Our approach works by analysing how a widely available AI speech model organises different sound categories -- such as nasal versus oral consonants, or voiced versus voiceless sounds -- and measuring whether those categories become harder to tell apart. We tested this on 890 speakers across 10 datasets in five languages, covering Parkinsons disease, cerebral palsy, and ALS. Because the method only needs healthy speech recordings to set up, it applies to any language with an existing acoustic model, currently covering 29 languages. The resulting profiles show clinicians which specific aspects of speech production are degrading, rather than providing a single opaque severity score. This could support remote monitoring of speech decline in neurodegenerative disease and enable screening in languages and settings where specialist assessment is unavailable.
Hakata, Y.; Oikawa, M.; Fujisawa, S.
Show abstract
Background. Federated learning (FL) enables collaborative model training across institutions without sharing patient-level data. However, standard FL algorithms such as FedAvg degrade under non-independently and non-identically distributed (non-IID) data, a prevalent condition when patient demographics, scanner hardware, and disease prevalence differ across hospital sites. Objective. We propose iPS-MFFL (Individualized Per-Site Meta-Federated Feature Learning), a federated framework with a hierarchical local-model architecture that addresses non-IID heterogeneity through (1) a shared feature extractor, (2) multiple weak-learner classification heads that can be trained with heterogeneous training objectives to promote complementary decision boundaries, (3) independent per-learner server aggregation so that each weak learner's parameters are averaged only with its counterparts at other clients, and (4) a lightweight meta-model, itself federated, that adaptively stacks the weak-learner outputs. Methods. We evaluate on the Brain Tumor MRI Classification dataset (7,200 images; 4 classes: glioma, meningioma, pituitary tumor, no tumor) partitioned across K = 5 simulated hospital sites using Dirichlet non-IID sampling (alpha = 0.3). Four baselines are compared: Local-only training, FedAvg, FedProx, and Freeze-FT. All experiments are repeated over three random seeds (13, 42, 2025) and evaluated using paired t-tests, Cohen's d effect sizes, and post-hoc power analysis.
Smah, M. L.; Seale, A. C.; Rock, K. S.
Show abstract
Network-based epidemic models have been instrumental in understanding how contact structure shapes infectious disease dynamics, yet widely used frameworks such as Erd[o]s-Renyi, configuration-model, and stochastic block networks do not explicitly capture the combination of fully accessible (saturated) within-group interactions and constrained between-group connectivity characteristic of many real-world settings. Here, we introduce the Multi-Clique (MC) network model, a generative framework in which individuals are organised into fully connected cliques representing stable contact groups (e.g., households, classrooms, or workplaces), with a limited number of external connections governing inter-group transmission. Using stochastic susceptible-infectious-recovered (SIR) simulations on degree-matched networks, we compare epidemic dynamics on MC networks with those on classical random graph models. Despite having an identical mean degree, MC networks exhibit systematically distinct behaviour, including slower epidemic growth, reduced peak prevalence, increased fade-out probability, and delayed time to peak. These effects arise from rapid within but constrained between clique transmission, creating structural bottlenecks that standard models do not capture. The MC framework provides an interpretable, data-driven representation of recurrent contact structure, with parameters that map directly to observable quantities such as household and classroom sizes. By isolating the role of intergroup connectivity, the model offers a basis for evaluating targeted intervention strategies that reduce between-group mixing while preserving within-group interactions. Our results highlight the importance of explicitly representing the real-life clique-based network structure in epidemic models and suggest that classical degree-matched networks may systematically overestimate epidemic speed and intensity in structured populations.
Prestige, E.; Warren-Gash, C.; Quint, J. K.; Evans, D.; Costello, R. E.; Mehrkar, A.; Bacon, S.; Goldacre, B.; Barley-McMullen, S.; Yameen, F.; Shah, P.; Natt, M.; Alder, Y.; Hulme, W.; Parker, E. P. K.; Eggo, R. M.
Show abstract
Electronic health records (EHRs) are a rich source of data which can be used to analyse health outcomes using computable phenotypes. With the approval of NHS England we used the OpenSAFELY secure analytics platform to design and assess phenotypes to classify three key respiratory viruses - respiratory syncytial virus (RSV), influenza, and COVID-19 - in English coded health data between September 2016 and August 2024. We compared specific and sensitive phenotypes to one another and to publicly available surveillance data. Cases from both phenotypes showed similar seasonal patterns to surveillance data. Sensitive phenotypes led to increased risk of misclassification than specific phenotypes for mild cases. For severe cases the risk of misclassification was higher in infants than for older adults, irrespective of the phenotype used. The phenotypes presented here offer a solution to classifying respiratory viruses from coded health records in the absence of testing information.
Wang, X.; Hammarlund, N.; Prosperi, M.; Zhu, Y.; Revere, L.
Show abstract
Automating Hierarchical Condition Category (HCC) assignment directly from unstructured electronic health record (EHR) notes remains an important but understudied problem in clinical informatics. We present HCC-Coder, an end to end NLP system that maps narrative documentation to 115 Centers for Medicare & Medicaid Services(CMS) HCC codes in a multi-label setting. On the test dataset, HCC-Coder achieves a macro-F1 of 0.779 and a micro-F1 of 0.756, with a macro-sensitivity of 0.819 and macro-specificity of 0.998. By contrast, Generative Pre-trained Transformer (GPT)-4o achieves highest score of a macro-F1 of 0.735 and a micro-F1 of 0.708 under five-shot prompting. The fine-tuned model demonstrates consistent absolute improvements of 4%-5% in F1-scores over GPT-4o. To address severe label imbalance, we incorporate inverse-frequency weighting and per-label threshold calibration. These findings suggest that domain-adapted transformers provide more balanced and reliable performance than prompt-based large language models for hierarchical clinical coding and risk adjustment.
Ben-Joseph, J.
Show abstract
Lightweight epidemic calculators are widely used for teaching and rapid scenario exploration, yet many omit the methodological detail needed for scientific reuse. We present a browser-native SIR calculator that exposes forward Euler and classical fourth-order Runge--Kutta (RK4) integration alongside epidemiologically interpretable outputs and a population-conservation diagnostic. The implementation is anchored to analytical properties of the deterministic SIR system, including the epidemic threshold, the peak condition, and the final-size relation. Benchmark experiments show that RK4 is essentially step-size invariant over practical discretizations, whereas Euler at a coarse one-day step overestimates peak prevalence by 3.97% and final size by 0.66% relative to a fine-step RK4 reference. These results demonstrate that browser-based tools can support publication-quality computational narratives when solver choice, diagnostics, and assumptions are treated as first-class outputs.
Feng, Y.; Deng, K.; Guan, Y.
Show abstract
Gene networks (GNs) encode diverse molecular relationships and are central to interpreting cellular function and disease. The heterogeneity of interaction types has led to computational methods specialized for particular network contexts. Large language models (LLMs) offer a unified, language-based formulation of GN inference by leveraging biological knowledge from large-scale text corpora, yet their effectiveness remains sensitive to prompt design. Here, we introduce Gene-Relation Adaptive Soft Prompt (GRASP), a parameter-efficient and trainable framework that conditions inference on each gene pair through only three virtual tokens. Using factorized gene-specific and relation-aware components, GRASP learns to map each pair's biological context into compact soft prompts that combine pair-specific signals with shared interaction patterns. Across diverse GN inference tasks, GRASP consistently outperforms alternative prompting strategies. It also shows a stronger ability to recover unannotated interactions from synthetic negative sets, suggesting its capacity to identify biologically meaningful relationships beyond existing databases. Together, these results establish GRASP as a scalable and generalizable prompting framework for LLM-based GN inference.
Mboya, G. O.
Show abstract
Machine learning models trained on observational data from one environment frequently fail when deployed in another, because standard learning algorithms exploit spurious correlations alongside causal ones. Invariant learning methods address this problem by seeking representations that support stable prediction across training environments, but their behavior on tabular data remains poorly characterized. We present CausTab, a gradient variance regularization framework for causal invariant representation learning on mixed tabular data. CausTab penalizes the variance of parameter gradients across training environments, providing a richer invariance signal than the scalar penalty used by Invariant Risk Minimization (IRM). We provide formal results showing that the gradient variance penalty is zero at causally invariant solutions and positive at solutions that rely on spurious features. Through experiments on synthetic data across three spurious-correlation regimes, four cycles of the National Health and Nutrition Examination Survey (NHANES), and four hospital systems in the UCI Heart Disease dataset, we demonstrate that: (1) IRM consistently degrades relative to standard empirical risk minimization (ERM) on tabular data, losing up to 13.8 AUC points in spurious-dominant settings, a failure we trace mechanistically to penalty collapse during training; (2) CausTab matches or exceeds ERM in every experimental condition; (3) CausTab achieves consistently better probability calibration than both ERM and IRM; and (4) invariant learning methods fail when environments differ in outcome prevalence rather than in spurious feature correlations, a boundary condition we characterize both empirically and theoretically. We introduce the Spurious Dominance Index (SDI), a practical scalar diagnostic for determining whether a dataset requires invariant learning, and validate it across all experimental settings
Mille-Fragoso, L. S.; Driscoll, C. L.; Wang, J. N.; Dai, H.; Widatalla, T. M.; Zhang, J. L.; Zhang, X.; Rao, B.; Feng, L.; Hie, B. L.; Gao, X. J.
Show abstract
Obtaining novel antibodies against specific protein targets is a widely important yet experimentally laborious process. Meanwhile, computational methods for antibody design have been limited by low success rates that currently require resource-intensive screening. Here, we introduce Germinal, a broadly enabling generative pipeline that designs antibodies against specific epitopes with nanomolar binding affinities while requiring only low-n experimental testing. Our method co-optimizes antibody structure and sequence by integrating a structure predictor with an antibody-specific protein language model to perform de novo design of functional complementarity-determining regions (CDRs) onto a user-specified structural framework. When tested against four diverse protein targets, Germinal successfully designed functional antibodies across all targets and binder formats, testing only 43-101 designs for each antigen. Validated designs also exhibited robust expression in mammalian cells and high sequence and structural novelty. We provide open-source code and full computational and experimental protocols to facilitate wide adoption. Germinal represents a milestone in efficient, epitope-targeted de novo antibody design, with notable implications for the development of molecular tools and therapeutics.
Roca, M.; Messuti, G.; Klepachevskyi, D.; Angiolelli, M.; Bonavita, S.; Trojsi, F.; Demuru, M.; Troisi Lopez, E.; Chevallier, S.; Yger, F.; Saudargiene, A.; Sorrentino, P.; Corsi, M.-C.
Show abstract
Neurodegenerative diseases such as Mild Cognitive Impairment (MCI), Multiple Sclerosis (MS), Parkinson s Disease (PD), and Amyotrophic Lateral Sclerosis (ALS) are becoming more prevalent. Each of these diseases, despite its specific pathophysiological mechanisms, leads to widespread reorganization of brain activity. However, the corresponding neurophysiological signatures of these changes have been elusive. As a consequence, to date, it is not possible to effectively distinguish these diseases from neurophysiological data alone. This work uses Magnetoencephalography (MEG) resting-state data, combined with interpretable machine learning techniques, to support differential diagnosis. We expand on previous work and design a Riemannian geometry-based classification pipeline. The pipeline is fed with typical connectivity metrics, such as covariance or correlation matrices. To maintain interpretability while reducing feature dimensionality, we introduce a classifier-independent feature selection procedure that uses effect sizes derived from the Kruskal-Wallis test. The ensemble classification pipeline, called REDDI, achieved a mean balanced accuracy of 0.81 (+/-0.04) across five folds, representing a 13% improvement over the state-of-the-art, while remaining clinically transparent. As such, our approach achieves reliable, interpretable, data-driven, operator-independent decision-support tools in Neurology.
Sooknah, M.; Srinivasan, R.; Sankarapandian, S.; Chen, Z.; Xu, J.
Show abstract
Genome-wide association studies (GWAS) have transformed our understanding of human biology, but are constrained by the need for predefined phenotypes. We introduce Vector2Variant (V2V), a general-purpose framework that transforms any set of high-dimensional measurements (such as machine learning embeddings) into a genome-wide scan for associations, without requiring rigid specification of a phenotype. Rather than testing genetic variants against single traits, V2V finds the axis in multivariate space along which carriers and non-carriers maximally differ, and produces a continuous "projection phenotype" that can be interpreted by association with disease labels. The projection phenotypes correlate with orthogonal clinical biomarkers never seen during training, suggesting the learned axes capture biologically meaningful variation. We applied V2V to imaging, timeseries, and omics modalities in the UK Biobank and recovered established biology (like the role of CASP9 in renal failure) without the need for targeted measurements, alongside novel associations including a frameshift variant in LRRIQ1 (potentially protective for cardiovascular disease). V2V is computationally efficient at genome-wide scale, producing summary statistics and disease associations that facilitate target prioritization without the need for phenotype engineering.
Pinero, S. L.; Li, X.; Lee, S. H.; Liu, L.; Li, J.; Le, T. D.
Show abstract
Long COVID affects millions of people worldwide, yet no disease-modifying treatment has been approved, and existing interventions have shown only modest and inconsistent benefits. A key reason for this limited progress is that current computational drug repurposing pipelines do not match well with the clinical reality of Long COVID. These patients often have persistent, multisystemic symptoms and may already be taking multiple medications, making treatment safety a primary concern. However, most repurposing workflows still treat safety as a downstream filter and rely on disease-associated targets rather than causal drivers. They also assume that the findings of one analysis would generalize across the diverse presentations of Long COVID. We introduce SPLIT, a safety-first repurposing framework that addresses these limitations. SPLIT prioritizes safety at the start of the candidate evaluation, integrates complementary causal inference strategies to identify likely driver genes, and uses a counterfactual substitution design to compare drugs within specific cohort contexts. When applied to cognitive and respiratory Long COVID cohorts, SPLIT revealed three main findings. First, drugs with similar predicted efficacy could have very different predicted safety profiles. Second, the drugs flagged as unfavorable were often different between the two cohorts, showing that drug prioritization is phenotype-specific. Third, SPLIT flagged 18 drugs currently under active investigation in Long COVID trials as having unfavorable predicted profiles. SPLIT provides a practical framework to identify safer, more context-appropriate candidates earlier in the process, supporting more targeted and better-tolerated treatment strategies for Long COVID.
Bhansali, R.; Gorenshtein, A.; Westover, B.; Goldenholz, D. M.
Show abstract
Manuscript preparation is a critical bottleneck in scientific publishing, yet existing AI writing tools require cloud transmission of sensitive content, creating data-confidentiality barriers for clinical researchers. We introduce the Paper Analysis Tool (PAT), a free, multi-agent framework that deploys 31 specialized agents powered by small language models (SLMs) to audit manuscripts across multiple quality dimensions without external data transmission. Applied to three published clinical neurological papers, PAT generated 540 evaluable suggestions. Validation by two expert reviewers (R.B., A.G.) confirmed 391 actionable, high-value revisions (90% agreement), achieving a 72.4% overall usefulness accuracy spanning methodological, statistical, and visual domains. Furthermore, deterministic re-evaluation of 126 agent-suggested rewrite pairs using Phase 0 metrics confirmed text improvement: total word count decreased by 25%, passive voice prevalence dropped sharply from 35% to 5%, average sentence length decreased by 24%, long-sentence fraction fell by 67%, and the Flesch-Kincaid grade improved by 17% . Our validation confirms that systematic, agent-driven pre-submission review drives measurable improvements, successfully converting manuscript optimization from an opaque, manual endeavor into a transparent and rigorous scientific process. Manuscript preparation is a critical bottleneck in scientific publishing, yet existing AI writing tools require cloud transmission of sensitive content, creating data-confidentiality barriers for clinical researchers. We introduce the Paper Analysis Tool (PAT), a free, multi-agent framework that deploys 31 specialized agents powered by small language models (SLMs) to audit manuscripts across multiple quality dimensions without external data transmission. Applied to three published clinical neurological papers, PAT generated 540 evaluable suggestions. Independent validation by two expert reviewers (R.B., A.G.) confirmed 391 actionable, high-value revisions (90% agreement), achieving a 72.4% overall usefulness accuracy spanning methodological, statistical, and visual domains. Furthermore, deterministic re-evaluation of 126 suggested Phase 0 rewrite pairs confirmed text improvement: total word count decreased by 25%, passive voice prevalence dropped sharply from 35% to 5%, average sentence length decreased by 24%, and long-sentence fraction fell by 67%, and the Flesch-Kincaid grade improved modestly. Our validation confirms that systematic, agent-driven pre-submission review drives measurable improvements, successfully converting manuscript optimization from an opaque, manual endeavor into a transparent and rigorous scientific process.
Nguyen, T. M.; Woods, C.; Liu, J.; Wang, C.; Lin, A.-L.; Cheng, J.
Show abstract
The apolipoprotein E {varepsilon}4 (APOE4) allele is the strongest genetic risk factor for late-onset Alzheimer's disease (AD), the most common form of dementia. APOE4 carriers exhibit cerebrovascular and metabolic dysfunction, structural brain alterations, and gut microbiome changes decades before the onset of clinical symptoms. A better understanding of the early manifestation of these physiological changes is critical for the development of timely AD interventions and risk reduction protocols. Multimodal datasets encompassing a wide range of APOE4- and AD-associated biomarkers provide a valuable opportunity to gain insight into the APOE4 phenotype; however, these datasets often present analytical challenges due to small sample sizes and high heterogeneity. Here, we propose a two-stage multimodal AI model (APOEFormer) that integrates blood metabolites, brain vascular and structural MRI, microbiome profiles, and other clinical and demographic data to predict APOE4 allele status. In the first stage, modality-specific encoders are used to generate initial representations of input data modalities, which are aligned in a shared latent space via self-supervised contrastive learning during pretraining. This objective encourages the learning of informative and consistent representations across modalities by leveraging cross-modality relationships. In the second stage, the pretrained representations are used as inputs to a multimodal transformer that integrates information across modalities to predict a key AD risk genetic variant (APOE4). Across 10 independent experimental runs with different train-validation-test splits, APOEFormer predicts whether an individual carries an APOE4 allele with an average accuracy of 75%, demonstrating robust performance under limited sample sizes. Post hoc perturbation analysis of the predictive model revealed valuable insights into the driving components of the APOE4 phenotype, including key blood biomarkers and brain regions strongly associated with APOE4.
Wang, S.; Ayubcha, C.; Hua, Y.; Beam, A.
Show abstract
Background: Developing generalizable neuroimaging models is often hindered by limited labeled data which has led to an increased interest in unsupervised inverse learning. Existing approaches often neglect geometric principles and struggle with diverse pathologies. We propose a symmetry-informed inverse learning foundation model to address these shortcomings for robust and efficient anomaly detection in brain MRI. Methods: Our framework employs a reconstruction-to-embedding pipeline, trained exclusively on healthy brain MRI slices. A 2D U-Net uses a novel, symmetry-aware masking strategy to reconstruct a disorder-free slice. Difference maps are embedded into a 1024-dimensional latent space via a Beta-VAE. Anomaly scoring is performed using Mahalanobis distance. We evaluated generalization by fine-tuning on external lesion datasets, BraTS Africa (SSA), and the ADNI-derived Alzheimer disease cohort (Alz). Results: On the source metastasis (Mets) dataset, the framework achieved high performance (AB1+MSE: 99.28% accuracy, 99.79% sensitivity). Generalization to the external lesion dataset (SSA) was robust, with the Symmetry ROC configuration achieving 91.93% accuracy. Transfer to the Alzheimer dataset (Alz) was more challenging, achieving a peak accuracy of 70.54% with a high false-positive rate, suggesting difficulty in separating subtle, diffuse changes. Conclusion: The symmetry-informed inverse learning framework establishes a robust foundation model for neuroimaging, showing strong performance for focal lesions and successful generalization under domain shift. Limitations in diffuse neurodegeneration underscore the necessity for richer representations and multimodal integration to improve future foundation models.
Dey, S. K.; Qureshi, A. I.; Shyu, C.-R.
Show abstract
Target trial emulation (TTE) enables causal inference from observational data but remains bottlenecked by manual, expert-dependent protocol operationalization. While large language models (LLMs) have advanced clinical knowledge extraction and code generation, their ability to automate end-to-end TTE workflows remains largely unexplored. We present an LLM-driven framework using retrieval-augmented generation to extract the five core TTE design parameters from the Carotid Revascularization and Medical Management for Asymptomatic Carotid Stenosis Trial (CREST-2) protocol and generate executable phenotyping pipelines for real-world EHR data. The performance of the framework was evaluated along two dimensions. First, protocol extraction accuracy was assessed against a gold-standard checklist of trial design components using precision, recall, and F1-score metrics. Second, outcome validity was evaluated through population-level concordance analyses comparing EHR-derived outcomes with published trial endpoints using standardized mean difference, observed-to-expected ratios, confidence interval overlap, and two-proportion z-tests. Further, Human-in-the-loop validation assessed the correctness of extracted clinical logic and phenotype definitions. Together, these evaluations demonstrate a structured approach for assessing LLM-driven protocol-to-pipeline translation for scalable real-world evidence generation.
Yang, Z.; Lyng, G. D.; Batra, S. S.; Tillman, R. E.
Show abstract
Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts, such as diagnoses, are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts and provide limited coverage of cases in which medically relevant concepts must be inferred. We present MedicalBench, a new benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note concept pairs, coupled with sentence level evidence identification. Built from MIMIC-IV discharge summaries and human verified ICD-10 codes, the dataset is curated through a multi stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. Annotators provide sentence level evidence spans and concise medical rationales. The final dataset contains 823 high quality examples. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs and a supervised baseline reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that explicitly incorporating reasoning cues and prompting to extract implicit evidence substantially improves medical concept extractions, while performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.
Asplin, P.; Mancy, R.; Keeling, M. J.; Hill, E. M.
Show abstract
Symptom propagation occurs when the symptoms of secondary cases are related to those of the primary case as a result of epidemiological mechanisms. Determining whether - and to what extent - symptom propagation occurs requires data-driven methods. Here we quantify the strength of symptom propagation as the increase in risk of a secondary case developing severe symptoms if the primary case has severe symptoms. We first used synthetic results to determine the data requirements to robustly estimate the strength of symptom propagation and to investigate the effect of severity-dependent reporting bias. Categorising symptom severity into two group (mild or severe; asymptomatic or symptomatic), our estimation requires only four summary statistics - the number of primary-secondary case pairs of each combination of symptom presentations. Our analysis showed that a relatively small number (100) of synthetic primary-secondary case pairs was sufficient to obtain a reasonable estimate of the strength of symptom propagation and 1,000 pairs meant errors were consistently small across replicates. Our estimates were robust to severity-dependent reporting bias. We also explored how symptom propagation can be separated from other individual-level factors affecting severity, using age dependence as an example. Although synthetic data generated from an age-structured model led to overestimations of the strength of symptom propagation, allowing disease severity to be age-dependent restored the accuracy of parameter estimation. Finally, we applied our methodology to estimate the strength of symptom propagation from three publicly available data collected during the COVID-19 pandemic with data on presence or absence of symptoms: England households, Israel households, and Norway contact tracing. Our age-free methodology indicated a 12-17% increase in the risk of being symptomatic if infected by someone symptomatic. Our positive estimates for the strength of symptom propagation persisted when applying our age-dependent methodology to the two household data sets with age-structured information (England and Israel). These findings demonstrate evidence for symptom propagation of SARS-CoV-2 and provide consistent estimates for its strength. Our synthetic data analysis supports the conclusion that these correlations are not a result of reporting bias or age-dependent effects. This work provides a practical tool for estimating the strength of symptom propagation that has minimal data requirements, enabling application across a wide range of pathogens and epidemiological settings.
Wang, X.-Y.; Li, M.-M.; Zhao, S.-M.; Jia, X.-Y.; Yang, W.-S.; Chang, L.-L.; Wang, H.-M.; Zhao, J.-T.
Show abstract
Stroke-associated pneumonia (SAP) is a common, severe complication in acute ischemic stroke (AIS) patients receiving bridging therapy (intravenous thrombolysis + mechanical thrombectomy), worsening prognosis and increasing mortality. Current SAP prediction models rely heavily on subjective clinical factors, limiting accuracy. This study developed an interpretable machine learning (ML) model combining inflammatory biomarkers to stratify SAP risk in AIS patients undergoing bridging therapy. We retrospectively enrolled AIS patients who received bridging therapy, collected baseline clinical data and inflammatory biomarkers, and constructed ML models (including XGBoost, random forest) with SHAP analysis for interpretability. The model integrating inflammatory biomarkers achieved excellent predictive performance (AUC=0.XX, 95%CI: XX-XX), outperforming traditional clinical models. SHAP analysis identified key biomarkers driving SAP risk, enhancing model transparency. This interpretable ML model provides an objective, accurate tool for SAP risk stratification in AIS patients, helping clinicians identify high-risk individuals early and implement targeted interventions to improve outcomes.
Undurraga Lucero, J. A.; Chesnaye, M.; Simpson, D.; Laugesen, S.
Show abstract
Objective detection of evoked potentials (EPs) is central to digital diagnostics in hearing assessment and clinical neurophysiology, yet current approaches remain time-intensive and sensitive to inter-individual noise variability. Many existing detection methods rely on population-based assumptions or computationally demanding procedures, limiting robustness and efficiency in real-world clinical settings. We present Fmpi, a digital EP detection framework enabling individualised, real-time response detection through analytical modelling of the spectral colour and temporal dynamics of background noise within each recording. Using extensive simulations and large-scale human electroencephalography datasets spanning brainstem, steady-state, and cortical EPs recorded in adults and infants, we demonstrate performance comparable or superior to state-of-the-art bootstrapped methods while operating at a fraction of the computational cost and maintaining well-controlled sensitivity with improved specificity. Importantly, Fmpi incorporates a futility detection mechanism enabling early termination of uninformative recordings, reducing testing time without compromising diagnostic reliability.